Cette vignette explique les manières habituelles d’utiliser la
variable .SD dans vos analyses de data.table .
C’est une adaptation ce cette réponse
donnée sur StackOverflow.
.SD?In the broadest sense, .SD is just shorthand for
capturing a variable that comes up frequently in the context of data
analysis. It can be understood to stand for Subset,
Selfsame, or Self-reference of the Data. That
is, .SD is in its most basic guise a reflexive
reference to the data.table itself – as we’ll see in
examples below, this is particularly helpful for chaining together
“queries” (extractions/subsets/etc using [). In particular,
this also means that .SD is itself a
data.table (with the caveat that it does not allow
assignment with :=).
The simpler usage of .SD is for column subsetting (i.e.,
when .SDcols is specified); as this version is much more
straightforward to understand, we’ll cover that first below. The
interpretation of .SD in its second usage, grouping
scenarios (i.e., when by = or keyby = is
specified), is slightly different, conceptually (though at core it’s the
same, since, after all, a non-grouped operation is an edge case of
grouping with just one group).
To give this a more real-world feel, rather than making up data,
let’s load some data sets about baseball from the Lahman database. In
typical R usage, we’d simply load these data sets from the
Lahman R package; in this vignette, we’ve pre-downloaded
them directly from the package’s GitHub page instead.
load('Teams.RData')
setDT(Teams)
Teams
# Warning: The packages `ellipsis` (>= 0.3.2) and `vctrs` (>= 0.3.8) are required
# as of rlang 1.0.0.
load('Pitching.RData')
setDT(Pitching)
Pitching
Readers up on baseball lingo should find the tables’ contents
familiar; Teams records some statistics for a given team in
a given year, while Pitching records statistics for a given
pitcher in a given year. Please do check out the documentation and explore
the data yourself a bit before proceeding to familiarize yourself with
their structure.
.SD on Ungrouped DataTo illustrate what I mean about the reflexive nature of
.SD, consider its most banal usage:
Pitching[ , .SD]
That is, Pitching[ , .SD] has simply returned the whole
table, i.e., this was an overly verbose way of writing
Pitching or Pitching[]:
identical(Pitching, Pitching[ , .SD])
# [1] TRUE
In terms of subsetting, .SD is still a subset of the
data, it’s just a trivial one (the set itself).
.SDcolsThe first way to impact what .SD is is to limit the
columns contained in .SD using the
.SDcols argument to [:
# W: Wins; L: Losses; G: Games
Pitching[ , .SD, .SDcols = c('W', 'L', 'G')]
This is just for illustration and was pretty boring. In addition to
accepting a character vector, .SDcols also accepts:
is.character to filter
columnspatterns() to filter column
names* by regular expression*see ?patterns for more details
This simple usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:
Column type conversion is a fact of life for data munging. Though fwrite
recently gained the ability to declare the class of each column up
front, not all data sets come from fread (e.g. in this
vignette) and conversions back and forth among
character/factor/numeric types
are common. We can use .SD and .SDcols to
batch-convert groups of columns to a common type.
We notice that the following columns are stored as
character in the Teams data set, but might
more logically be stored as factors:
# teamIDBR: Team ID used by Baseball Reference website
# teamIDlahman45: Team ID used in Lahman database version 4.5
# teamIDretro: Team ID used by Retrosheet
fkt = c('teamIDBR', 'teamIDlahman45', 'teamIDretro')
# confirm that they're stored as `character`
str(Teams[ , ..fkt])
# Classes 'data.table' and 'data.frame': 2895 obs. of 3 variables:
# $ teamIDBR : chr "BOS" "CHI" "CLE" "KEK" ...
# $ teamIDlahman45: chr "BS1" "CH1" "CL1" "FW1" ...
# $ teamIDretro : chr "BS1" "CH1" "CL1" "FW1" ...
# - attr(*, ".internal.selfref")=<externalptr>
The syntax to now convert these columns to factor is
simple:
Teams[ , names(.SD) := lapply(.SD, factor), .SDcols = patterns('teamID')]
# print out the first column to demonstrate success
head(unique(Teams[[fkt[1L]]]))
# [1] BOS CHI CLE KEK NYU ATH
# 101 Levels: ALT ANA ARI ATH ATL BAL BLA BLN BLU BOS BRA BRG BRO BSN BTT ... WSN
Note:
:= is an assignment operator to update the
data.table in place without making a copy. See reference
semantics for more.names(.SD), indicates which columns we are
updating - in this case we update the entire .SD.lapply(), loops through each column of the
.SD and converts the column to a factor..SDcols to only select columns that have
pattern of teamID.Again, the .SDcols argument is quite flexible; above, we
supplied patterns but we could have also supplied
fkt or any character vector of column names.
In other situations, it is more convenient to supply an
integer vector of column positions or a
logical vector dictating include/exclude for each column.
Finally, the use of a function to filter columns is very helpful.
For example, we could do the following to convert all
factor columns to character:
fct_idx = Teams[, which(sapply(.SD, is.factor))] # column numbers to show the class changing
str(Teams[[fct_idx[1L]]])
# Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
Teams[ , names(.SD) := lapply(.SD, as.character), .SDcols = is.factor]
str(Teams[[fct_idx[1L]]])
# chr [1:2895] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" ...
Lastly, we can do pattern-based matching of columns in
.SDcols to select all columns which contain
team back to factor:
Teams[ , .SD, .SDcols = patterns('team')]
Teams[ , names(.SD) := lapply(.SD, factor), .SDcols = patterns('team')]
** A proviso to the above: explicitly using column numbers
(like DT[ , (1) := rnorm(.N)]) is bad practice and can lead
to silently corrupted code over time if column positions change. Even
implicitly using numbers can be dangerous if we don’t keep smart/strict
control over the ordering of when we create the numbered index and when
we use it.
Varying model specification is a core feature of robust statistical
analysis. Let’s try and predict a pitcher’s ERA (Earned Runs Average, a
measure of performance) using the small set of covariates available in
the Pitching table. How does the (linear) relationship
between W (wins) and ERA vary depending on
which other covariates are included in the specification?
Here’s a short script leveraging the power of .SD which
explores this question:
# this generates a list of the 2^k possible extra variables
# for models of the form ERA ~ G + (...)
extra_var = c('yearID', 'teamID', 'G', 'L')
models = unlist(
lapply(0L:length(extra_var), combn, x = extra_var, simplify = FALSE),
recursive = FALSE
)
# here are 16 visually distinct colors, taken from the list of 20 here:
# https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
col16 = c('#e6194b', '#3cb44b', '#ffe119', '#0082c8',
'#f58231', '#911eb4', '#46f0f0', '#f032e6',
'#d2f53c', '#fabebe', '#008080', '#e6beff',
'#aa6e28', '#fffac8', '#800000', '#aaffc3')
par(oma = c(2, 0, 0, 0))
lm_coef = sapply(models, function(rhs) {
# using ERA ~ . and data = .SD, then varying which
# columns are included in .SD allows us to perform this
# iteration over 16 models succinctly.
# coef(.)['W'] extracts the W coefficient from each model fit
Pitching[ , coef(lm(ERA ~ ., data = .SD))['W'], .SDcols = c('W', rhs)]
})
barplot(lm_coef, names.arg = sapply(models, paste, collapse = '/'),
main = 'Wins Coefficient\nWith Various Covariates',
col = col16, las = 2L, cex.names = 0.8)
Fit OLS coefficient on W, various specifications, depicted as bars with distinct colors.
The coefficient always has the expected sign (better pitchers tend to have more wins and fewer runs allowed), but the magnitude can vary substantially depending on what else we control for.
data.table syntax is beautiful for its simplicity and
robustness. The syntax x[i] flexibly handles three common
approaches to subsetting – when i is a logical
vector, x[i] will return those rows of x
corresponding to where i is TRUE; when
i is another data.table (or a
list), a (right) join is performed (in the
plain form, using the keys of x and
i, otherwise, when on = is specified, using
matches of those columns); and when i is a character, it is
interpreted as shorthand for x[list(i)], i.e., as a
join.
This is great in general, but falls short when we wish to perform a conditional join, wherein the exact nature of the relationship among tables depends on some characteristics of the rows in one or more columns.
This example is admittedly a tad contrived, but illustrates the idea; see here (1, 2) for more.
The goal is to add a column team_performance to the
Pitching table that records the team’s performance (rank)
of the best pitcher on each team (as measured by the lowest ERA, among
pitchers with at least 6 recorded games).
# to exclude pitchers with exceptional performance in a few games,
# subset first; then define rank of pitchers within their team each year
# (in general, we should put more care into the 'ties.method' of frank)
Pitching[G > 5, rank_in_team := frank(ERA), by = .(teamID, yearID)]
Pitching[rank_in_team == 1, team_performance :=
Teams[.SD, Rank, on = c('teamID', 'yearID')]]
Note that the x[y] syntax returns nrow(y)
values (i.e., it’s a right join), which is why .SD is on
the right in Teams[.SD] (since the RHS of :=
in this case requires nrow(Pitching[rank_in_team == 1])
values).
.SD operationsOften, we’d like to perform some operation on our data at the
group level. When we specify by = (or
keyby =), the mental model for what happens when
data.table processes j is to think of your
data.table as being split into many component
sub-data.tables, each of which corresponds to a single
value of your by variable(s):
Grouping, Illustrated
In the case of grouping, .SD is multiple in nature – it
refers to each of these sub-data.tables,
one-at-a-time (slightly more accurately, the scope of
.SD is a single sub-data.table). This allows
us to concisely express an operation that we’d like to perform on
each sub-data.table before the re-assembled result
is returned to us.
This is useful in a variety of settings, the most common of which are presented here:
Let’s get the most recent season of data for each team in the Lahman data. This can be done quite simply with:
# the data is already sorted by year; if it weren't
# we could do Teams[order(yearID), .SD[.N], by = teamID]
Teams[ , .SD[.N], by = teamID]
Recall that .SD is itself a data.table, and
that .N refers to the total number of rows in a group (it’s
equal to nrow(.SD) within each group), so
.SD[.N] returns the entirety of .SD
for the final row associated with each teamID.
Another common version of this is to use .SD[1L] instead
to get the first observation for each group, or
.SD[sample(.N, 1L)] to return a random row for
each group.
Suppose we wanted to return the best year for each team, as
measured by their total number of runs scored (R; we could
easily adjust this to refer to other metrics, of course). Instead of
taking a fixed element from each sub-data.table,
we now define the desired index dynamically as follows:
Teams[ , .SD[which.max(R)], by = teamID]
Note that this approach can of course be combined with
.SDcols to return only portions of the
data.table for each .SD (with the caveat that
.SDcols should be fixed across the various subsets)
NB: .SD[1L] is currently optimized by GForce
(see
also), data.table internals which massively speed up
the most common grouped operations like sum or
mean – see ?GForce for more details and keep
an eye on/voice support for feature improvement requests for updates on
this front: 1, 2, 3, 4, 5, 6
Returning to the inquiry above regarding the relationship between
ERA and W, suppose we expect this relationship
to differ by team (i.e., there’s a different slope for each team). We
can easily re-run this regression to explore the heterogeneity in this
relationship as follows (noting that the standard errors from this
approach are generally incorrect – the specification
ERA ~ W*teamID will be better – this approach is easier to
read and the coefficients are OK):
# Overall coefficient for comparison
overall_coef = Pitching[ , coef(lm(ERA ~ W))['W']]
# use the .N > 20 filter to exclude teams with few observations
Pitching[ , if (.N > 20L) .(w_coef = coef(lm(ERA ~ W))['W']), by = teamID
][ , hist(w_coef, 20L, las = 1L,
xlab = 'Fitted Coefficient on W',
ylab = 'Number of Teams', col = 'darkgreen',
main = 'Team-Level Distribution\nWin Coefficients on ERA')]
abline(v = overall_coef, lty = 2L, col = 'red')
A histogram depicting the distribution of fitted coefficients. It is vaguely bell-shaped and concentrated around -.2
While there is indeed a fair amount of heterogeneity, there’s a distinct concentration around the observed overall value.
Tout ceci n’est simplement qu’une brève introduction sur la puissance
de .SD qui facilite la beauté et l’efficacité du code dans
data.table !